habitat 2
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > Canada (0.04)
- Asia > Japan > Honshū > Kansai > Hyogo Prefecture > Kobe (0.04)
Bring the Apple, Not the Sofa: Impact of Irrelevant Context in Embodied AI Commands on VLA Models
Pugacheva, Daria, Moskalenko, Andrey, Shepelev, Denis, Kuznetsov, Andrey, Shakhuro, Vlad, Tutubalina, Elena
Vision Language Action (VLA) models are widely used in Embodied AI, enabling robots to interpret and execute language instructions. However, their robustness to natural language variability in real-world scenarios has not been thoroughly investigated. In this work, we present a novel systematic study of the robustness of state-of-the-art VLA models under linguistic perturbations. Specifically, we evaluate model performance under two types of instruction noise: (1) human-generated paraphrasing and (2) the addition of irrelevant context. We further categorize irrelevant contexts into two groups according to their length and their semantic and lexical proximity to robot commands. In this study, we observe consistent performance degradation as context size expands. We also demonstrate that the model can exhibit relative robustness to random context, with a performance drop within 10%, while semantically and lexically similar context of the same length can trigger a quality decline of around 50%. Human paraphrases of instructions lead to a drop of nearly 20%. To mitigate this, we propose an LLM-based filtering framework that extracts core commands from noisy inputs. Incorporating our filtering step allows models to recover up to 98.5% of their original performance under noisy conditions.
- Leisure & Entertainment (0.46)
- Information Technology (0.46)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.41)
Structured Preference Optimization for Vision-Language Long-Horizon Task Planning
Liang, Xiwen, Lin, Min, Ruan, Weiqi, Xu, Rongtao, Liu, Yuecheng, Chen, Jiaqi, Lin, Bingqian, Zhuang, Yuzheng, Liang, Xiaodan
Existing methods for vision-language task planning excel in short-horizon tasks but often fall short in complex, long-horizon planning within dynamic environments. These challenges primarily arise from the difficulty of effectively training models to produce high-quality reasoning processes for long-horizon tasks. To address this, we propose Structured Preference Optimization (SPO), which aims to enhance reasoning and action selection in long-horizon task planning through structured preference evaluation and optimized training strategies. Specifically, SPO introduces: 1) Preference-Based Scoring and Optimization, which systematically evaluates reasoning chains based on task relevance, visual grounding, and historical consistency; and 2) Curriculum-Guided Training, where the model progressively adapts from simple to complex tasks, improving its generalization ability in long-horizon scenarios and enhancing reasoning robustness. To advance research in vision-language long-horizon task planning, we introduce ExtendaBench, a comprehensive benchmark covering 1,509 tasks across VirtualHome and Habitat 2.0, categorized into ultra-short, short, medium, and long tasks. Experimental results demonstrate that SPO significantly improves reasoning quality and final decision accuracy, outperforming prior methods on long-horizon tasks and underscoring the effectiveness of preference-driven optimization in vision-language task planning. Specifically, SPO achieves a +5.98% GCR and +4.68% SR improvement in VirtualHome and a +3.30% GCR and +2.11% SR improvement in Habitat over the best-performing baselines.
Habitat 2.0: Training Home Assistants to Rearrange their Habitat
We introduce Habitat 2.0 (H2.0), a simulation platform for training virtual robots in interactive 3D environments and complex physics-enabled scenarios. We make comprehensive contributions to all levels of the embodied AI stack – data, simulation, and benchmark tasks. These large-scale engineering contributions allow us to systematically compare deep reinforcement learning (RL) at scale and classical sense-plan-act (SPA) pipelines in long-horizon structured tasks, with an emphasis on generalization to new objects, receptacles, and layouts. We find that (1) flat RL policies struggle on HAB compared to hierarchical ones; (2) a hierarchy with independent skills suffers from'hand-off problems', and (3) SPA pipelines are more brittle than RL policies.
Learning Human Preferences Over Robot Behavior as Soft Planning Constraints
Narcomey, Austin, Tsoi, Nathan, Desai, Ruta, Vázquez, Marynel
Preference learning has long been studied in Human-Robot Interaction (HRI) in order to adapt robot behavior to specific user needs and desires. Typically, human preferences are modeled as a scalar function; however, such a formulation confounds critical considerations on how the robot should behave for a given task, with desired -- but not required -- robot behavior. In this work, we distinguish between such required and desired robot behavior by leveraging a planning framework. Specifically, we propose a novel problem formulation for preference learning in HRI where various types of human preferences are encoded as soft planning constraints. Then, we explore a data-driven method to enable a robot to infer preferences by querying users, which we instantiate in rearrangement tasks in the Habitat 2.0 simulator. We show that the proposed approach is promising at inferring three types of preferences even under varying levels of noise in simulated user choices between potential robot behaviors. Our contributions open up doors to adaptable planning-based robot behavior in the future.
Galactic: Scaling End-to-End Reinforcement Learning for Rearrangement at 100k Steps-Per-Second
Berges, Vincent-Pierre, Szot, Andrew, Chaplot, Devendra Singh, Gokaslan, Aaron, Mottaghi, Roozbeh, Batra, Dhruv, Undersander, Eric
We present Galactic, a large-scale simulation and reinforcement-learning (RL) framework for robotic mobile manipulation in indoor environments. Specifically, a Fetch robot (equipped with a mobile base, 7DoF arm, RGBD camera, egomotion, and onboard sensing) is spawned in a home environment and asked to rearrange objects - by navigating to an object, picking it up, navigating to a target location, and then placing the object at the target location. Galactic is fast. In terms of simulation speed (rendering + physics), Galactic achieves over 421,000 steps-per-second (SPS) on an 8-GPU node, which is 54x faster than Habitat 2.0 (7699 SPS). More importantly, Galactic was designed to optimize the entire rendering + physics + RL interplay since any bottleneck in the interplay slows down training. In terms of simulation+RL speed (rendering + physics + inference + learning), Galactic achieves over 108,000 SPS, which 88x faster than Habitat 2.0 (1243 SPS). These massive speed-ups not only drastically cut the wall-clock training time of existing experiments, but also unlock an unprecedented scale of new experiments. First, Galactic can train a mobile pick skill to >80% accuracy in under 16 minutes, a 100x speedup compared to the over 24 hours it takes to train the same skill in Habitat 2.0. Second, we use Galactic to perform the largest-scale experiment to date for rearrangement using 5B steps of experience in 46 hours, which is equivalent to 20 years of robot experience. This scaling results in a single neural network composed of task-agnostic components achieving 85% success in GeometricGoal rearrangement, compared to 0% success reported in Habitat 2.0 for the same approach. The code is available at github.com/facebookresearch/galactic.
Habitat 2.0: Training home assistants to rearrange their habitat
The Habitat 2.0 simulation platform is for training virtual robots in interactive, physics-enabled 3D environments. We present the new ReplicaCAD 3D data set of apartments; the Habitat 2.0 simulator, capable of simulating 850x faster than real time; and the Home Assistant Benchmark, a suite of common tasks for assistive robots.
Facebook updates Habitat environment to train 'embodied AI'
Where does your enterprise stand on the AI adoption curve? Take our AI survey to find out. In 2019, Facebook open-sourced AI Habitat, a simulator that can train AI systems embodying things like a home robot to operate in environments meant to mimic real-world settings, like apartments and offices. Today Facebook announced that it's extended the capabilities of Habitat to make it "orders of magnitude" faster than other 3D simulators available, allowing researchers to perform more complex tasks in simulation, like setting the table and stocking the fridge. Coinciding with this, Facebook collaborated with 3D space capture company Matterport to open-source what it claims is the largest dataset of indoor 3D scans to date.